Clustering header categories extracted from web tables

نویسندگان

  • George Nagy
  • David W. Embley
  • Mukkai S. Krishnamoorthy
  • Sharad C. Seth
چکیده

Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column) headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Grammar Based Analysis of Column Header Categories for Web Tables

As part of a project to harvest semi-structured data from web tables, we describe an approach to extract an abstract representation of the column-header categories based on a context-free grammar for linear strings. The column-header structure is generally an XY-tessellation. The grammar provides a compact representation of infinitely many structural variations possible within column headers. B...

متن کامل

Recovering Semantics of Tables on the Web

The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as sear...

متن کامل

Crossed Clustering method on Symbolic Data tables

In this paper we propose a crossed clustering algorithm in order to partition a set of symbolic objects in a predefined number of classes and to determine, in the same time, a structure (taxonomy) on the categories of the object descriptors. The procedure is an extension of the classical simultaneous clustering algorithms proposed on binary and contingency tables. Our approach is based on a dyn...

متن کامل

A Large Public Corpus of Web Tables containing Time and Context Metadata

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a small subset of the tables is relational, meaning that they contain structured data describing a set of entities [2]. As these relational Web tables cover a very wide range of different topics, there is a growing body of research investigating the utility of Web table data for completing cross...

متن کامل

Markup-Agnostic Table Cell Extraction

Tables are very commonly used to present relational data. This report focuses on mining structured data from markup language specified tables. Table recogition, table interpretation and presentation of results are discussed. First, two categories of features are developed to recognize genuine tables. These recognized tables provide knowledge of table types we need in order to synthesize tables ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015